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SIMULTANEOUS PREDICTION OF INDEPENDENT POISSON 

OBSERVABLES 

By Fumiyasu Komaki 

University of Tokyo 

Simultaneous predictive distributions for independent Poisson 
observables are investigated. A class of improper prior distributions 
for Poisson means is introduced. The Bayesian predictive distribu- 
tions based on priors from the introduced class are shown to be 
admissible under the Kullback-Leibler loss. A Bayesian predictive 
distribution based on a prior in this class dominates the Bayesian 
predictive distribution based on the Jeffreys prior. 

1. Introduction. Suppose that we have independent observations x(l), 
x(2), . . . , x(n), where x(l) = (xi(l), X2(l), ■ • ■ , Xd(l)) (I & {1,2, ... , n}) is a set 
of d independent Poisson random variables with unknown mean parameters 
Ai,A 2 ,...,A d . We write x (n) = (x(l),x(2), . . . ,x(n)) and A = (Ai, A 2 , . . . , A d ). 
An unobserved set X( m ) = (x(n + l),x(n + 2), . . . , x(n + m)) from the same 
distribution is predicted by using a predictive distribution p(xr m y,x^). We 
adopt the Kullback-Leibler divergence from the true distribution to a pre- 
dictive distribution, 

(1) L»(p(x (m) |A),p(x (m) ;x (n) )) = ^p(s (m) |A)log .f^^il, , 

x (m) P{ x (rn),X ) 

which has a natural information theoretic meaning, as a loss function. 

By sufficiency reduction, it suffices to consider the problem of predicting 
V= (yi,2/2, ■■■iVd) using x = (x 1 ,x 2 ,...,x d ), where 

n I n n n \ 

x = J2x(i) = ^Tx 1 (i),^2x 2 {i),...,^2x d {i) , 

i=l \i=l i=l i=l J 
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j=i \j=i 3=1 3=1 

under the loss 

p(y|A) 



(2) D(p(y\\),p(y; x)) = ]>>(y|A) log 



p(y;x) 



In the following, we assume that x = (x±,X2, • • • , x^) and y = (j/i, 2/2, ■ • ■ , 2/d) 
are distributed according to 

d 

p(x|a) = n^( x ii A ) 
1=1 

= exp{-(aAi + aA 2 H h aAd)}- 



and 



d 

1=1 

= exp{-(6Ai + 6A 2 H h 6A rf )j ; ; — , 

respectively, and that a and b are positive real numbers. 

There exist many studies that recommend using Bayesian predictive den- 
sities of the form 



pA x {m)\x (n> ) 



( n )\ Jp(x {m) \ 9) P (x^\6)7r(e)de 



fp{x( n )\9)ir(6)de 



rather than plug-in densities of the form p(x( m )|0), where {p{x\9)\9 € 0} is 

a parametric model, tt{9) is a prior and 9 is an estimate of 9; see Aitchison 
and Dunsmore (1975) and Geisser (1993). 

When we use a Bayesian procedure, the choice of a prior distribution is 
an important problem. Noninformative prior distributions or vague prior 
distributions are often used to construct Bayesian predictive distributions. 
The Jeffreys prior naturally arises from various discussions based on the 
Kullback-Leibler divergence [see Hartigan (1965), Akaike (1978), Bernardo 
(1979) and Clarke and Barron (1994)]. However, Bayesian methods based on 
the Jeffreys prior do not always perform satisfactorily, especially in problems 
with multidimensional parameters [see, e.g., Jeffreys (1961), page 182, and 
Berger and Bernardo (1989)]. 

Here, we investigate the use of shrinkage priors, which give more weight 
to parameter values close to the origin than the Jeffreys prior does, for 



PREDICTION OF POISSON OBSERVABLES 



3 



constructing predictive distributions dominating the predictive distribution 
based on the Jeffreys prior. If we adopt a plug-in distribution p(y\X(x)) as 
a predictive distribution, the loss (1) for the plug-in distribution can be 
regarded as a loss for the estimator 6. Thus, predictive distribution theory 
is a natural generalization of estimation theory under the Kullback-Leibler 
loss. 

Since Stein (1956) showed that the maximum likelihood estimator for the 
mean vector of the d-dimensional Normal model Nd(fj,,I) is not admissible 
when d > 3 and James and Stein (1961) introduced an estimator dominating 
the maximum likelihood estimator, numerous studies have been done on 
shrinkage methods for parameter estimation. 

For the means of d independent Poisson distributions, Clevenson and 
Zidek (1975) proposed a class of estimators dominating the maximum likeli- 
hood estimator when d > 2 under the normalized square loss Y^ii^i — Aj) 2 /Aj. 
Many studies on simultaneous estimation of Poisson means have been done 
under the loss function Y^i(\ — K) 2 /\ m , where m is a nonnegative integer. 

Ghosh and Yang (1988) characterized linear admissible estimators of the 
form Aj = CiXi + hi under the Kullback-Leibler loss D(p(y\\),p(y\\(x))). 
There are relatively few studies of estimation under the Kullback-Leibler 
loss compared with the number of studies based on other loss functions 
such as squared-error. What is called Stein's loss is the Kullback-Leibler 
divergence with the direction opposite to our setting (1). 

In contrast to the large number of studies on parameter estimation, little 
attention has been given to decision theory of predictive distributions except 
for some studies on group models [Murray (1977) and Ng (1980)] and some 
recent work from an asymptotic viewpoint [Vidoni (1995), Komaki (1996) 
and Haussler and Opper (1997)]. In particular, it seems that no studies have 
been done on the admissibility of predictive distributions. Recently, how- 
ever, Komaki (2001) considered the d-dimensional Normal model Nd(/j,,I), 
d>3, and showed that the Bayesian predictive distribution based on Stein's 
harmonic prior 7Ts(/x) oc ( d_2 ) [Stein (1974)] incorporates the advantage 
of shrinkage methods and dominates the Bayesian predictive distribution 
based on the Lebesgue prior ni(fj) oc 1, which is the best predictive distribu- 
tion invariant under the translation group. Since a lot of statistical problems 
are naturally formulated as prediction problems [Aitchison and Dunsmore 
(1975) and Geisser (1993)], this kind of approach seems to be useful for many 
problems, and further decision theoretic studies especially on admissibility 
are required. 

In Section 2 we introduce a class of improper prior densities for Poisson 
means and show that the predictive distributions based on the proposed 
priors are admissible under the Kullback-Leibler loss. In Section 3 we show 
that a Bayesian predictive distribution based on a prior 7rs(A) in the in- 
troduced class dominates the Bayesian predictive distribution based on the 
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Jeffreys prior, and that the plug-in distribution with the generalized Bayes 
estimator based on 7rs(A) is inadmissible under the Kullback-Leibler loss. In 
Section 4 we discuss the relation between the main results here and several 
previous studies on Bayesian theory from asymptotic viewpoints. 

2. A class of admissible predictive distributions. We introduce a class 
of improper prior densities, 

A /3i-l A /3 2 -l... A /3 d -l 

(3) vr Q p(\) dX 1 dX 2 ■ ■ ■ dX d oc 1 . 2 d . , dXi dX 2 --- dX d 

(Ai + A 2 H h ArfJ a 

with < -a + A < 1 and ft > 0, i = 1, 2, . . . , d. 

Theorem 1. The Bayesian predictive distribution based on the prior 

A /3i-l A /3 2 -l . . . A /3d-l 

7r Qi/ g(A) dAi dA 2 • • • dA<2 oc 1 . 2 d . dAi dA 2 • ■ • c?A d 

(Ai + A 2 H h Xd) a 

with —a + J2i Pi > ft > 0, z = 1, 2, . . . , d, is given by 



a 



a + b J \a + b , 

r (Ei Xi + Yiyi-a + Hi ft)r(E» xi + Si ft) 
r(Ei xi - a + Ei ft)r(Ei *i + Ei w + E; ft) 

x r(ji + yi + ft)r(x 2 + y 2 + ft) ■ ■ ■ T(x d + y d + (3 d ) 
r(n + ft)r(x 2 + ft) ■ • • T{x d + p d ) yi \y 2 \ ■ ■ ■ yd 

Proof. By using Lemma 1 below, we have 
P7r a , (y\x) 

_/^(A)nti{exp(-aA t )(aA t )^/xi!}n j li{exp(-6A J )(bA J )^/^!}dA 
jK a ,p(X)Uk=i{eM-ah)(aXk) Xk /xk ] -}dX 

_ / ^(A) nti[exp{-(a + b)X t }{(a + fr)A t }*^] dA 
/ 7ia,/9(A) nfc=i{ ex P(- a Afc) (aXk) Xk } dX 

a Xj b Vj 



x 



„E.-' "-E. < /,E.« 
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r (Ei Xi + Hi v% - a + J2i Pi) 

x r(xi + 2/1 + /9i)r(ac 2 + y 2 + fo) • • • + y d + Af) 

- -i 

x r(xi + /3i)r(x 2 + ft) • ■ • T(x d + f3 d ) yi \y 2 l ■ ■ -y d \ . 
Thus we obtain the desired result. □ 

Lemma 1. When — a + Ei A > an d Pi>0, i = 1,2, ...,d, we have that 

W31-1W32-1 W3 d -1 d 

J {X \ + x 2 + ... + \ d y \{^M-a\){a\r} dXx d\ 2 ... dX d 
The proof of Lemma 1 is given in the Appendix. 

Let V be the class of predictive distributions that have finite risk for all 
values of A. For example, the plug-in distribution 

bx{ \ (bxi/a) Vi 



p(y\K x )) = II ex P 



i=l 



with the maximum likelihood estimator X(x) = x/a is not included in V, 
because the loss (2) becomes infinite when Xi > and Xi(x) = 0. If a predic- 
tive distribution is admissible in V, then it is admissible in the class of all 
predictive distributions. 

Before proving the admissibility of the proposed class of Bayesian predic- 
tive distributions, we establish the following theorem. 

Theorem 2. Ifp(y;x) EV, then the risk function 
r p (X) = E[D(p(y\X),p(y;x))\X] 
is a continuous function of X . 



Proof. The risk function is given by 

, p(i/|A) 
' p{y;x) 



^p(x|A)^p(y|A)log- 



G 

(4) 
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^2p(y\X)logp(y\X) 

y 

+ E^( X I A ) J2p(v\ x ){- l °sP(y, x)}. 



The first term on the right-hand side of (4) is 
^2p(y\\) logp(y|A) 



E 



(b\i) Vi 

V exp(-6A,) l —{-b\i + yilog(b\i) -logyj} 

yv- 



i=i Uji=o 

This is finite for all values of A and is a continuous function of A. The second 
term on the right-hand side of (4) is 

^2p(x\X)^2p(y\X){- logp(y; x)} 



EEII ex P(- aA *) 

x y i=l 



(oAj) 



Xi\ 



X j 



Jexp(-b\j) J , {-logp(y;x)} 



(5) 



exp|-(a + 6)^Ai| 

EE 



x y 



xi\x 2 l ■ ■■x d \yi\y 2 \ ■ ■ -yd 



{-logp(y;x)} 



x A^ 1+yi A^ 2+2/2 ■ ■ ■ \ x d d+Vd 



If p(y; x) € V, the power series in Ai, A2, • • • , Xd, 



EE 



xi!x 2 ! • • -x d \yi\y 2 \ ■■■yd 



{-\ogp{y-x)}X^X x 2 



£1+2/1 \X2+V2 . . . yx d +y d 



converges absolutely for all A € R rf . Thus, (5) is a continuous function of A. 
Therefore, the risk function is continuous for all values of A if p(y;x) € V . 
□ 



Theorem 3. For every d>l, the Bayesian predictive distributions based 
on the priors in the class {vr a]j g(A) : < — ct + J2f=i A < 1, A > 0, i = 1, 2, . . . , d} 
defined by (3) are admissible under the Kullback-Leibler loss. 
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The proof of Theorem 3 is given in the Appendix. 

3. A shrinkage prior dominating the Jeffreys prior. In this section, we show 
that the Bayesian predictive distribution based on the Jeffreys prior is in- 
admissible and give an explicit form of a shrinkage predictive distribution 
dominating the Bayesian predictive distribution based on the Jeffreys prior. 

First, we show that the Bayesian predictive distribution p^ a Jy\x) is in- 
admissible when —a + J2i Pi > 1- 

Theorem 4. When -a + J2iPi > 1 and fa > 0, % = 1,2, . . . ,d, the Bayesian 



predictive distribution p„ 



based on 7r Qj ^(A) is dominated by the Bayesian 



predictive distribution p^, Ay\x) based on n- ^(A), where a := Et Pi ~ 1 an( ^ 
(ft, Ai) :=(Pi',P 2 ,...,Pd). 

Proof. From Theorem 1, we have 
E[D{p(y\X),p^(y\x))\X] - E[D(p(y\X),p^.(y\x))\X] 



E 



E 



Ptt- 



1 " a, 3 

log 
log 



\x) 



P7r a , 3 (y\x) 

a 



a + b 



-a+a 



E 



r(Egj + Syi-Q + SA)r(S^-« + SA) 1 
* r(E^-fi + EA)r(E^ + Eyi-« + EA) J 

log r (^2 xi-a + Y^P^j - lo s r (^2 x i ~ a + Yl & j 



(a — a) log a 



A 



(6) 



E 



logrf Y,Xi + Y^Vi-a + Y,0i) 

\ i i if 

-logr(^^ + ^ yi -a + ^Aj 

\ i i if 



(a — a) log(a + b) 



E 



log T ^2 Xi + l + a-aj - log T ^2 x i + 1 
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— (a — a) log a 



— E 



log r Xi+^yi + l + a- aj 



-logT^Xi + ^yi + lj - (a-a)log(o + 6) A 

When fi:=J2i A* = 0, 

E[D(p(y\X), Pnap (y\x))\X]-E[D(p(y\X),p n (y\x))\X] 

= (a — a) log > 0. 

a 

When fi > 0, by using Lemma 2 below, we have 

E[D(p(y\X),p naj3 (y\x))\X] - E[D(p(y\X),p n& -(y\x))\X] > 0, 

since J2i x i + 2/i an d Si are Poisson random variables with parameters 
(a + b) J2i Aj and a ^ i Aj , respectively. □ 

Lemma 2. Lei X be a Poisson random variable with mean fi. Then 

e [log r(x + 1 + c )- log r(x + i)- dog 

where c is a positive constant, is a strictly decreasing function of fi > 0. 

The proof of Lemma 2 is given in the Appendix. 

In the following, we set vr s (A) = ■K a = d /2-i,f3=(i/2,...,i/2)W- 

Corollary 1. When d>3, the Bayesian predictive distribution p 7TS (y\x) 
based on the prior 7rs(A) dominates the Bayesian predictive distribution 
p nj (y\x) based on the Jeffreys prior 

7rj(A) dXi dX 2 - ■ ■ dX d oc — — - — , dAi dX 2 ■ ■ ■ dX d . 

(XiX2---X d yi z 

Proof. The Jeffreys prior is equal to i^a=o,p=(i/i,...,\/2)- P ne desired 
results follow from Theorem 4 because —a + J2 i Pi = d/2 > 1. □ 

Figure 1 shows the difference between the risk of p. Kj (y\x) and that of 

p* s (y\ x )- 

Since Pic s (y\x), based on the prior vrg(A), dominates p n j(y\x), based on 
the Jeffreys prior 7rj(A), it seems to be reasonable to adopt 7rg(A) as a default 
prior instead of 7rj(A). 
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Fig. 1. The difference between the expected divergences, E[D(p(y\X),p 7T - l (y\x))\\] — 
E[D(p(y\X),p^ s (y\x))\X], which depends on X only through fj, = Ai + A2 + ■ ■ ■ + X d , 
for d = 3, 5, 8, 12. 

When we adopt a prior distribution 7r(A), the plug-in distribution p(y\X(x)), 
where X(x) is the generalized Bayes estimator based on 7r(A), is often used 
for prediction. 

Theorem 5. The plug-in distribution p{y\X{x)) with the generalized 
Bayes estimator X(x) based on tts(X) is inadmissible under the Kullback- 
Leibler loss. 

The proof of Theorem 5 is given in the Appendix. 

It can be shown that the plug-in distribution p(y\X) with the generalized 
Bayes estimator A based on 7rs(A) is admissible in the class of all plug- 
in distributions. However, it is inadmissible in the class of all predictive 
distributions. Therefore, it is not reasonable to restrict the class of predictive 
distributions to plug- in distributions. 

4. Some asymptotic properties and discussion. In this section, we dis- 
cuss the relation between the results in the previous sections and several pre- 
vious studies on Bayesian theory from asymptotic viewpoints. Suppose that 
x(l),x(2), . . . , x(n), x{n + 1), . . . ,x(n + m) are independent random variables 
from a true density p(x\9) that belongs to a statistical model {p(x\6) \ 6 G G}. 
The dimension of the parameter space 6 is d. Let x^ = (s(l), x(2), . . . , x(n)) 
and X( m \ = (x(n + 1), x{n + 2), . . . , x{n + m)). The objective is to construct 
a good Bayesian predictive distribution Pn{x(m) \x^) based on a prior tt. 

When x(l), x(2), . . . , x(n), x(n + l), . . . , x(n + m) are independent sets of d in- 
dependent Poisson random variables with mean parameters Ai,...,Ad, we 
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consider a slight generalization of the problem introduced in Section 1. The 
objective is to predict y that is a set of d independent Poisson random 
variables with mean parameters 6A1, 6A2, . . . , bXd, b > 0, by using an obser- 
vation x that is a set of d independent Poisson random variables with mean 
parameters aXi, aX?, ■ ■ ■ , aX^, a > 0. Here, a and b correspond to n and m, 
respectively. 



4.1. Some asymptotics. First, we consider the asymptotics where a and 
d are fixed and b goes to infinity. In this subsection d > 3 is assumed. The 
asymptotics are closely related to the setup where n = and m — > 00, which 
has been studied in reference analysis, coding theory and prequential anal- 
ysis as we will see in the next subsection. 

When fj, := J2i A$ = 0, we have 

E[D(p(y\X), P7r] (y\x))\X} - E[D{p(y\X),p^(y\x))\X] 

1 h g >0 

2 / a 

from (7). Thus, the risk difference between the Bayesian predictive distribu- 
tion based on the Jeffreys prior vrj(A) and that based on the shrinkage prior 
7rs(A) is of order log b. 

When n ^ 0, the risk difference converges to a positive constant when 
a and d are fixed and b goes to infinity. By evaluating (6) using Stirling's 
formula logT(x) = log(2-7r) 1 / 2 + (x — 1/2) log x — x + o(l), it can be easily 
verified that 

lim (E[D(p(y\X),p nj (y\x))\X] - E[D(p(y\X), P7Ts (y\x))\X]) 

b—foo 



E 



>0. 



logr(5> i + ^) -logrf^Xi + l) ^loga^ 



Second, we consider the asymptotics where b and d are fixed and a goes 
to infinity. There are many statistical applications where the objective is 
to construct a good predictive distribution for a future observation x/ m ) by 
using the observed data x^ and n is relatively large. An important example 
is one-step prediction, b = m = 1. Improper prior distributions are widely 
used to construct Bayesian predictive distributions. Asymptotic properties 
of predictive distributions for one-step prediction have been studied [Vidoni 
(1995), Komaki (1996), Hartigan (1998) and Komaki (2002b)]. When we 
consider the Poisson model, by a discussion similar to the previous studies, 
it can be shown that the loss function for a Bayesian predictive distribution 
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can be expanded as 

D(p(y\6),p n (y\x)) 

(8) 

d 



\Ygu(6)(0 l K - 6 % ) 2 + terms independent of tt + O p (a 2 ), 



2 

i=i 



where ^ = V /3 , 0% = {{k)i} 1/3 , 

Mi = : + -o )\d Xl log' 



oVl 1 Wj(A) 



Xi=Xi/a 



+ o p (a x ) 



and agu(9) = 9a6 l is the Fisher information. The risk difference between the 
Bayesian predictive distribution based on the Jeffreys prior 7rj(A) and that 
based on the shrinkage prior 7rs(A) is of order a~ 2 when a goes to infinity 
[see Komaki (2002b) for details on the asymptotics of shrinkage predictive 
distributions]. Equation (8) gives an intuitive meaning for the Kullback- 
Leibler loss. 

Third, we consider the asymptotics where a and b are fixed and d goes 
to infinity. The data dimension d becomes large in many fields of applied 
statistics such as spatial statistics, contingency table analysis and population 
data analysis. 

It is easy to show the following result by evaluating (6) using Stirling's 
formula. If limsup d ^ 00 (^/(i) < oo, where fi^ := J2i=i \i then 

< liminf ~{E[D(p(y\\),p WJ (y\z))\M ~ E[D(p(y\X), P7rs (y\x))\X]} 

d— >oo a 

< \imm V \{E[D(jp{y\\),p^{y\x)M ~ E[D(p(y\\), P7rs (y\x))\\]} 

d— »oo 

< OO. 

For example, when At, i = 1, 2,3, ... , are generated independently from a dis- 
tribution that has mean A, then lim^^ 00 (/irf/ci) = A almost surely and the 
risk difference is of order d as d goes to infinity. 

4.2. Relation to previous work. In coding theory, the ideal code-word 
length of a Bayes code for a data string xi m \ = (x(l), . . . ,x(m)) based on a 
proper prior density n(9) is given by 



-logp 7r (x (m) ) = -log I p(x(l),...,x(m)\0)ir(6)de 

(9) 



m— 1 „ 

■^2 log p n (x(l + l)\x(i))n(e\x^)d9, 

1=0 
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where n = 0. The average of the expected redundancy with respect to ir(9) 
is given by 

^jr(0;#(m)) 

tt(0) fp(x im) \9) 



(10) 



iog^ \ P {x {m) \e')<K{e')d6 

+ \ogp(x {m) \0) 



dxr m \ d9 



= y y p(, m ie) bg j^^,^,^^,^ , 

which is the mutual information between 9 and xr m )- 

Bernardo (1979) introduced the notion of reference prior distributions 
and showed that the Jeffreys prior asymptotically maximizes the mutual 
information between 9 and x/ m \ = (x(l),x(2), . . . ,x(m)) when m — > oo by 
using a heuristic discussion, although the mutual information cannot be 
properly defined if it(0) is improper. Prequential analysis [Dawid (1984) and 
Skouras and Dawid (1999)] is also based on the logarithmic scoring rule used 
to give code lengths. 

In the discussions above, there exist serious technical difficulties associ- 
ated with infinite integrals when we consider improper prior distributions. 
If tt{9) is improper (9) cannot be regarded as an ideal code-word length of 
a Bayes code. A compact subset or a sequence of compact subsets of the 
original parameter space O has been considered to handle the difficulties 
in many previous studies. The heuristics are artificial but useful for treating 
the problems rigorously. 

When n = and m goes to infinity, under suitable regularity conditions, 
the mutual information between xr m \ and 9 is expanded as 

h{9;x {m) ) = Uog^-+ [ K{9)\og\g{9)\ x / 2 d9 
(11) 2 ^ JK 

r 7T{9)log7T{9)d9 + o(l), 



K 

where K is a compact subset of the original parameter space and \g(9)\ 
is the determinant of the Fisher information matrix [Ibragimov and Has- 
minskii (1973) and Clarke and Barron (1994)]. Thus (11) is maximized 
when tt(9) tx \g(9)\ 1 ^ 2 , which is the Jeffreys prior. The difference in I n (9; X( m \) 
due to the choice of a prior ir(6) is of order 1 when m goes to infinity. 

Here we consider the Poisson model and introduce an alternative method 
to deal with the difficulties associated with improper priors. Suppose that 
a transmitter A and a receiver B commonly observe a data sequence x^ = 
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(x(l), x(2), . . . , x(n)). Only the transmitter A can observe the subsequent 
data sequence xt m \ = (x(n + 1), . . . ,x(n + m)). The transmitter A sends xr m -\ 
to the receiver B by using a Bayes code based on a prior vr(A). Then the 
ideal code-word length of the Bayes code for x^ m \ can properly be defined 
by 

-log J P {x {m) \X)TT{\\x^)dX 

if the posterior density ir{0\x^ n ^ is a proper density. The Bayes risk 
/ vr(A) £ P (xW | A) £ P(x (m) |A) log ^j'^. dA 

coincides with the mutual information (10) when n = 0. 

Now we consider the slightly generalized Poisson model. When a is close 
to 0, the observation x provides only a small amount of information and 
the situation is close to the setup that has been studied in reference anal- 
ysis and Bayes coding theory, where the Jeffreys priors are recommended. 
However, Corollary 1 in Section 3 shows that the Bayesian predictive dis- 
tribution based on the shrinkage prior 7rs(A) has better performance than 
that based on the Jeffreys prior vrj(A) even in such a situation, since the risk 
function of the shrinkage prior is smaller than that of the Jeffreys prior for 
all a > and b > [see also Komaki (2002a) for related discussion for group 
models] . 

Note that our discussion is based on the original parameter space. It seems 
difficult to analyze the shrinkage phenomenon under the assumption that 
the real parameter value is in a compact subset of the original parameter 
space. 

Finally, we note that predictive distributions based on the Jeffreys prior 
seem to become inadmissible under many loss functions other than the 
Kullback-Leibler loss. The admissibility of the predictive distribution based 
on the Jeffreys prior and shrinkage predictive distributions under other loss 
functions requires further study, although the Kullback-Leibler divergence 
is a natural loss function in several important streams in Bayesian theory. 



APPENDIX 

Proofs of lemmas and Theorems 3 and 5. 



Proof of Lemma 1. Let \i = Ai + \2~\ \-Xd and Wi = Xi/pL, i = 1, . . . , d — 1. 

Since the relation 

dw\ ■ ■ ■ dwd-i du= — dXi ■ ■ ■ dXd 
V/V 
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holds, we have 

. / d \ -a d 

/ E A Af- 1 Af 2 - 1 ---A^- 1 n{exp(-aA l )(aA i r}dA 1 dA 2 ---dA, 

\i=l ) i=l 

= / 00 / i- a +Eft+E^- 1 exp(-a^)aS^ 
Jo 

/ d-1 \ &+2.'d-l 

x 1 — mj dtoi dit^ • • • dwd-i d\i 

Proof of Lemma 2. The derivative of E[logT(X + 1 + c) - logr(A" + 
1) — clog/i|/i] satisfies the following inequality: 

^-E[logT(X + 1 + c) - logr(X + 1) - clog^] 

OO /j 

= - 5] exp(-/x)^- lo g r(A; + 1 + c) 
fc=o K - 

oo fe-1 

+ E exp(-/i) (fc p _ i ^ logr(fc + 1 + c) 

+ 5>xp(- / i)^ T io g r(£ ; + i) 

k=0 

oo fc— 1 

-E-p(-rf(fnv^r (t+ i)-£ 

oo k 

= exp(-/x)^-{log(& + 1 + c) - log(/c + 1)} - - 
fc=o ^ 

oo 1- 

/X C C 



< ^exp(-//)- 



-exp(-/xK ^ — - 



A* '"'^(fc + l) 



exp(— /i) < 0. 
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We have thus proved the desired result. □ 

Proof of Theorem 3. The admissibility is proved by using Blyth's 
method [Blyth (1951)]. For convenience, we put 

7r Q g(X) d\i d\2 ■ ■ ■ dXd = — j — 7-r ^ , dX\ dX2 ■ ■ ■ dXd 

in this proof. We use a sequence of priors {tt^ «(A) = vr Qi/ 3 (A) ^hf (//)}, where 
{/i;} is a sequence of functions defined by 

fl, ifO</i<l, 

W = | i-^y' if K/x<Z, 
I 0, if Z < /*. 

Function sequences of this kind are introduced by Brown and Hwang (1982) 
and have been used to prove the admissibility of various generalized Bayes 
estimators. 

First, we see that the Bayesian predictive distribution 

Jp(y\X)p(x\X)^(X)dX 

pji] (y\ x ) = m — 

fp(x\X)ir® p (X)dX 

based on the prior 7r^L(A) minimizes the Bayes risk under 7r^o(A) by using 
Aitchison's discussion [Aitchison (1975), page 549]. 

The Bayes risk of a predictive distribution p(y;x) is given by 

/i(A)EK-|A)EKy|A)log||||dA 
(12) = f 7rW /3 (A)E^|A)E^I A ) 1 °g^l A ) a!A 

J x y 

- I ^(x)Y,p(^)T,p(y\ x ) l °sp(y^)d\. 

J x y 

The first term on the right-hand side of (12) is finite and does not depend 
on p(y;x). The second term on the right-hand side of (12) is 

x y 

= -EXX [!1 ( x >y)i°sp(yi x ) 

x y *P 



y a ' 
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where 

Pji] (x,y)= p(x\\)p(y\\)n®g(\) d\ 

and 

p m ( X )= fp( x \X)^(X)d\. 

a, (3 J 

This is minimized when p{y;x) =p m (y\x). Thus, p m (y\x) minimizes 

the Bayes risk (12). 

Therefore, it suffices to show that 

'^(X){E[D(p(y\X), P7Ta J y \x))\X} 

(13) 

- E[D(p{y\X),p m (y\x))\X}\dX^0 as oo 

to prove the admissibility of the Bayesian predictive distributions based on 
the priors in {^^(A) : < — a + J2i A < 1j A > 0, i = 1, 2, . . . , d} because the 
risks of the Bayesian predictive distributions with the priors in the proposed 
class are finite for all values of A and Theorem 2 holds. 
Now we obtain a convenient expression for the integral 

Tr [ ^(X){E[D(p(y\X),p na Jy\x))\X}-E[D(p(y\X),p nll] (y\x))\X]} dX. 

a.,0 

Let 

/l\ a -E^ +1 m 1 „ /l\ a -Eft +1 

7Ta,MM) : = [-) » <,/?(/*) : = 2^ W ) 

and 

(14) ■.= Trl\p(w 1 ,w 2 ,...,w d - 1 ) 

._ TO a-i, „fe-l . . . f 



8=1 / 



Then we have 

7r ajj a(A)(iAi dA 2 • • • dX d 
(15) =Tr a> p(fj l ,w 1 ,...,Wd-i)dfjbdw 1 --- dw d -i 

= ^a,p{^a,l3(wi, ■ ■ .,Wd-i)dfjbdWi • ■ • dw d -l 
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and 

(16) =n®p(ji,w 1 ,...,w d -i)d(idw 1 --- dw d ~i 

= ^a^CA 4 )^,/?^!) • • • ,Wd-i) dfidwi ■ • • dw d -i- 

Let x = xi + x 2 + • ■ ■ + x d and y = y\ + t/2 + " " " + Vd- If a prior 7r(/x, ioi, 
u> 2 , . . . ,w d -i) has the form n(fj,,w 1 ,w 2 ,. . . ,w d -i) = n(fJ,)'K(w 1 ,w 2 , ■ ■ . ,w d -i), 
then the relation 

p 7r (fi,Wl,W2, . ■ . ,w d -i\xi,x 2 , ...,x d ) 

= [p(x, X1,X 2 ,..., X d -l \fjL,Wl,W 2 ,..., W d -l) 
X ir([J,,Wl,W 2 , . . .,W d -l)] 

p(x, Xl,X 2 ,..., X d -1 |m> Wl,W 2 ,..., W d -l) 



(17) 



x 7r(/z, w\,w 2 , . . . ,Wd-i) dfj,dw\ dw 2 ■ ■ ■ dwd-i 



\p(x\n)-K(n)] 



p(x\(i)ir((i) dfi 



x [p(xi,x 2 , . . .,x d -i\x,w 1 ,w 2 ,.. .,w d -i)ir(w 1 ,w 2 , . . .,w d -i)] 

p(xi,x 2 , . . . , X d -1 \x,wi,w 2 ,..., W d -l) 

- -1 

x tt(wi,w 2 , . . . ,w d -i) dwi dw 2 --- dwd-i 

holds, because 

p(x 1 ,sc 2 ,...,a;d|Ai,A2,...,A t i) 

= p(x, Xi,X 2 ,..., X d -1 \fi,Wl,W 2 ,..., W d -l) 

= p(x\n,wi,w 2 ,...,w d -i) 

x p(xi,x 2 ,.. . ,X d -l\x,fi,Wl,W 2 ,. . -,W d -l) 
= P{x\^)p{xi,x 2 , x d -i\x, Wl,W 2 ,..., W d -l). 

From the relations (14)-(17), and 

p(yi,y2,---,y d \M,h,---Ad) 

= p(y\v)p(yx,y2, • • • , y d -i \y,w 1 ,w 2 ,..., w d -i), 
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it follows that the difference of the Kullback-Leibler losses for 

P* a ,0 (yi> V2, ■ ■ ■ i UdlXl, X 2 , . . ■ , X d ) 

and 

pji] {yi,y2,---,yd\xi,x 2 ,...,x d ) 

a, P 

is given by 

D (p(y\^),Pir a>0 (y\x)) - D(p(y\X),p [i] (y\x)) 

a,/3 

= EE'"E^ i '^'-''^i Ai ' A2 ''-'' A ' i ) 

p(yi,y2, ■ ■ ■ ,yd\^i, h, ■ ■ ■ , Xd) 



(18) 



where 



and 



Since 



3/1 3/2 Vd 



x log 



p-K a< p(yi,y2, ■ ■ ■ ,yd\xi,x 2 , ■ ■ ■ ,x d ) 



■ ■ -J2p(yi,y2, ■ ■ ■ ,yd\Xi, \ 2 , ■ ■ ■ , Xd) 

p(yi,y2, ■ ■ ■ ,yd\Xi, X 2 , . . . , X d ) 



yi ?/2 yd 



x lot 



' PJi] (yi> 2/2, • • • , Vd\xi, x 2 , . . ■ , x d ) 

pjn (yi,y2, ■ ■ ■ ,yd\xi,x 2 ,. . .,x d ) 

x log 



3/1 3/2 2/d 



P7r Q fl(yi,y2 > --- ) yd|a:i > a;2 ) --- > xd) 



1^° P(.y\^)p(.x\fl)7T a)l3 (n) dfl 

f™ p(x\n)n a ,p(ji) dp 

J™ P(y\»)p(x\ti)n®p(fi) dp 
f™ p(x\p)Tr®p(p) dp, 



^exp(-a^)^- 



X lo£ 
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exp{-(s + r)//}- 



exp{-(s + r)/ij vr(/i|x) ef/i 



5>xp(-a/z)^ 



:r! 
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log 



exp(— sfj,)/j, v 



Io° exp(-sfj,)n v Tr(fi\x) d[i 



+ log 



(exp(-T^)// 



J °° exp(-sfi)fi v 7r(fi\x)dfi 
J °° exp(— s/j,)h v ex-p(—Tn)/j, w 7r(iJ,\x) d/j, 



we have 



ds 



^exp(-a^)— — 



^exp(-s/x)^-lo 



exp(-s/x)(s^)^/(y!) 



y! J oo exp(-s/i)(s/i)5/(y!)vr(/i|x)^ 



lim — 



EEE exp(-a/i)^p- exp(-sjt)-^— exp(-r^)^^ 



X v w 



VI 



Wl 



X lo£ 



exp(—Tfi)fi w J^° exp(—sfj,)n v Tr(fj,\x) dfi 



/ °° exp(—sn)[i v exp(—Tfj,)fj, w 7r(n\x) dfi 



^Eexp(-t^) 



3! 



ft- H-H\og- , 



where i := a + s, z := x + u and 



/ oo ^exp(-s/i)(s/i) t 7(-;;!)7r(^|x)^ 
J °° exp(-s/i)(s/i) 1 7(w!)7r(/i|x) d/i 



/ °° /x exp(-tfi)(tfi) z / (z\)ir(n) djx 
f™exp(-tn)(tn) z /(zl)ir(fi)dn ' 
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We put c = a - Ei Pi + 1 (0 < c < 1) and gi(fx) = (l/2)hf{fi). Then the 
expected divergence from p(y\fJ-) to p m (y\x) is expressed by 

w a,0 

E[D(p(y\fi),p m (y\x))\tA 

a,0 

(a/if 



^exp(-a^) 



.T! 



(19) 



u 



y\ Jexp(-bn)(bfj,)y/(y\)p ji] (fj,\x)dfi 



'a,f3 



ra+b 
J a „ 



- At log (it, 



2! \ /U 



where 



(20) 



= / °°/iexp(-t/x)(t/i)V(z!)/A c gi(n)dfi 
'* Jo°° exp(-t//)(^) 2 /(z\)ii~ c gi([i) dfj, 

= J oo eM-t^) z+1 ~ c 9i^)d f i 
1 exp(-tn)(tfi) z " c gi{^i) dn 

-t^expi-t^y+^g^ 

POO 

+ / ^ 1 exp(-t Ai ){(z + l-c)(t A i) 2 " c %(^) 

JO 

+ M z+1 - c ft / (^)}^ 

/•oo 

i / exp(-tfi){tfi) z - c gi{n) dfi 



_ z + l-c | / oo exp(-t/i)(^+ 1 -^(^)^ 
i t 2 / °° exp(-tfi)(tfi) z - c gi(fi)dfi' 

In the same way, the expected divergence from to 

a \S+i-c/ b yp( i + y_ c+1 ) 



a + 6/ r(x-c + l)y! 



is expressed by 

E[Dip(y\^,p^ e (y\i))\A 
(21) =E ex P (- a rt^Sexp(-trt^log gS^WM!ZW 
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2_,exp(-i/x)— ^-^ n-n\og — — J dt. 

From (18), (19) and (21), we obtain the following expression for the inte- 
gral in (13): 

Jn [ ^(X){E[D(p(y\\),p^(y\x))\\) 

-E[D(p(y\\),p m (y\x))\X]}dX 

' 7T [ ^){E[D(p(y\ri,p na jmM 
(22) - E[D(p(y\ri,p w (y\x))\fi]}dn 

a, /3 ' 

a-\-b ( poo 



a UO 



xf>xp(-^)^- 

n Z ' 



Hu - A* log — — dfi } dt. 



We show (13) by evaluating (22) using the following inequalities, (23) and 
(24), similar to the inequalities used by Ghosh and Yang (1988) to prove the 
admissibility of a class of linear estimators of Poisson means of the form 
Aj = qxj + bi under the Kullback-Leibler loss. 

We have 



/%(^- c £exp(-^)^f 



z + l-c A , z + l — c\ 

- A* log — dfi 



z=0 



< / 5i (/")^" c E ex P(-^) ii 7 

JO r> Z\ 



Z + l-C — tpLli - {z + l- C)\ 

- /*i,t + A* n \ dfJ- 

t z + l — c J 

= E— U ex p(-^(^) 7TT ^^ 

(23) -jf exp(-t / u) 5i (Ai)M 2 " c ^}{^M-(^ + 1 - c )} 
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OO , C _1 r /. ..\Z + l-C 



z=0 



z + l-c 

r (tu) z+1 ~ c 1 

i- 1 exp(-t/i) + gi(v)t(tfi) z - c dfi 

L z+l-c J 

exp(-t[i)gi(n){tn) z ~ c dfi 



x (z + 1 -c)} 

oo , c -2 ( f oo (+,,\z+l-c 



t f f°° (tu) z+ 1 

E— Tl / exp(-t M )g[(//) ^g— d/i {t£ M - (z + 1 - c)}. 



z=0 

By using (20) and the inequality 

2 + 1 1 



z + l — c 1 — c' 

where < c < 1 , we have 

/■oo 00 (tll) z / Z -\- 1 — C Z+l — c\ 

y o gi(/i)/i~ c $^exp(-t/i)— ^— — ^ Am~/* kg J 

(24) < ' C ~ 3 y 1 {/o 00 exp(-^)(^)-+ 1 -^(^)^} 2 

-l-c^(z + l)! fexp(-tp)(t(i)'-' : gi(ji)dp 

_ ^ 3 y> 2 {/ oo exp(-^)(^)—(^(/i))^(/^)^} 2 
1 - (z + 1)! /exp(-^)(^) 2 - c /i?(A) d/i 

" 1 - c ^ (z + 1)! fexp(-tji)m z - c hf(fi)dji 

POO 

x / exp(-tfi)(tfi) z ~ c hf(n)dfi 
Jo 

fC—3 oo o /"oo 



2 = 

2t c ~ 3 



1 -c 



(2 + l)!7 
{l_exp(-tM)}(^) 1-c (M(M)) 2 ^ 



The derivative of hi(p) is given by 

0, if < /x < 1, 

{ if l</x<Z, 

0, if Z < /U. 
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From (22), (24) and (25), we have 

7r^(A){£7[i3(p(j/|^),p B . ai/J (y| a; ))-D( P (y| A i),p [q (y\x))\X]}dX 



a,0 



(i-c)t 2 7 

a+b (2 



' f 2 r 1 1 



r a+b 2 l 2 /i l 

-A (l-c)t 2 log/ *~ (l-c)logZ U 



as Z — > oo. 



We have thus proved the theorem. □ 



Proof of Theorem 5. From Lemma 1, the generalized Bayes estima- 
tor of A with respect to vrg(A) is given by 

* = / AgjA) n j {exp(-aA,)(aA I )"V(x t !)}dA 
* /7r s (A)n j {exp(-aA 4 )(aA i ) :c 7(^!)}^ 

_ q" 2 [r(S fc *k + 2)/r(E; + d/2 + 1)] n j¥i r(xj + i/2)r(x t + 3/2) 
a" 1 [r(E fc x fc + i)/r(Ej *z + d/2)] rfo + 1/2) 

_i Ejgj + i / 1 

a£fc^ + d/2V* 2, 
The plug-in distribution with A(cc) is given by 

p(y\^)=p(y\fi,w) =p(y\p)p(y\y,w), 

where 

+ 1/2 



A y f v \ 

p(y\ji)=exy(-jl) — and p(y|y, w) = ^ J w)f u)f • • • to™. 

We show that the predictive distribution p 7rs (y\x)p(y\y,w) dominates the 
plug-in distribution p(y|A). The difference between the risk of the plug- in 
distribution p(y\X) and that of Pn s (y\x)p(y\y, w) is given by 

E[D(p(y\X),p(y\X))\X] - E[D(p(y\X),p ns (y\x)p(y\y,w))\X] 

=E[D(p(y\ri,p(mm-m(p(y\ri,p« s (mm 
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From (21), the expected Kullback-Leibler divergence fromp(y|//) to p ns (y\x) 

is 

E[D(p(y\fj) iPvB (y\x))\X] 

(26) 

fa+b ™ .(tu) z fz + l , , 

Since the Kullback-Leibler divergence from p(y|//) to p(y|/i) is given by 
D(p{y\fi),p{y\fi)) = b( £-//- //log - ), 



/' 



we have 

£[D(p(y|/x),Ky|A))|A] 



(27) 

, .(a/if/x + l x + 1 

= 6 1 exp(-a/z)— — ( — // -//log 

2=0 ^ 



x! V a a// 



Note that the integrand of (26) coincides with (27) multiplied by 1/6 
when t = a. Hence, to prove the theorem, it suffices to show that the inte- 
grand of (26) is a decreasing function of t for all values of //. 

The derivative of the integrand of (26) is given by 

d < + \W ( z + l i 2 + 1 

- £exp(-i//) — ^— - // - //log — 

z=0 r 



2fl 1 ^ / 

// 2 -^ + - + Eexp(-t//)^log(, + l) 



z=0 



E ex p(-^)|j f ri)T lo g^ + i) 



Since 



1 1 - (tfi)' 



if 00 {tLi) z+2 

^^|exp(-t//)+t//exp(-t//) + ^exp(-t//) ^ + 2 ^, 



{exp(— t/i) + t/zexp(— t/i)} 



t 2 // 

- 2^ ex Pv-W — r 1 L1 . /> 
~ z! \z+ 1 2 + 2/ 
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2=0 



exp(— tfi)- 



1 . ^ ( t/i )* 1 

— exp(-i/x) + 2^ ex P(-^J- 



tfl ^ z\ z + 1 

and 

2 exp(-tfx)^- \og(z + 1) - exp(-t M ) ^ log(z + 1) 



2=0 2=1 



= - ]T exp(-^)i^{log(z + 2) - log(z + 1)} 

2 = Z - 

<-]Texp -t/i V 7 

2 = ^ ^ 2 



we have 



- J: exp(-t^)— ^— - M - ^log — j < exp(-^) < 0. 

2 = ^ 



Thus, the integrand of (26) is a strictly decreasing function of t. There- 
fore, (26) is smaller than (27) for all values of A. □ 
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